Goto

Collaborating Authors

 image-level label


Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

Neural Information Processing Systems

A simple and effective way to improve long-tailed object detection (L TOD) is to use extra data to increase the training samples for tail classes. However, collecting bounding box annotations, especially for rare categories, is costly and tedious. Therefore, previous studies resort to datasets with image-level labels to enrich the amount of samples for rare classes by exploring image-level semantics (as shown in Figure 1 (a)). While appealing, directly learning from such data to benefit detection is challenging since they lack bounding box annotations that are essential for object detection.



Open-Vocabulary Object Detection via Language Hierarchy

Neural Information Processing Systems

Recent studies on generalizable object detection have attracted increasing attention with additional weak supervision from large-scale datasets with image-level labels.However, weakly-supervised detection learning often suffers from image-to-box label mismatch, i.e., image-levellabels do not convey precise object information.We design Language Hierarchical Self-training (LHST) that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors.LHST expands the image-level labels with language hierarchy and enables co-regularization between the expanded labels and self-training. Specifically, the expanded labels regularize self-training by providing richer supervision and mitigating the image-to-box label mismatch, while self-training allows assessing and selecting the expanded labels according to the predicted reliability.In addition, we design language hierarchical prompt generation that introduces language hierarchy into prompt generation which helps bridge the vocabulary gaps between training and testing.Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.


Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

Neural Information Processing Systems

A simple and effective way to improve long-tailed object detection (L TOD) is to use extra data to increase the training samples for tail classes. However, collecting bounding box annotations, especially for rare categories, is costly and tedious. Therefore, previous studies resort to datasets with image-level labels to enrich the amount of samples for rare classes by exploring image-level semantics (as shown in Figure 1 (a)). While appealing, directly learning from such data to benefit detection is challenging since they lack bounding box annotations that are essential for object detection.


Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

Ma, Jiaqi, Xie, Guo-Sen, Zhao, Fang, Li, Zechao

arXiv.org Artificial Intelligence

Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2\% improvement on Pascal-5\textsuperscript{i} and a 9.7\% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.




Open-Vocabulary Object Detection via Language Hierarchy

Neural Information Processing Systems

Recent studies on generalizable object detection have attracted increasing attention with additional weak supervision from large-scale datasets with image-level labels.However, weakly-supervised detection learning often suffers from image-to-box label mismatch, i.e., image-levellabels do not convey precise object information.We design Language Hierarchical Self-training (LHST) that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors.LHST expands the image-level labels with language hierarchy and enables co-regularization between the expanded labels and self-training. Specifically, the expanded labels regularize self-training by providing richer supervision and mitigating the image-to-box label mismatch, while self-training allows assessing and selecting the expanded labels according to the predicted reliability.In addition, we design language hierarchical prompt generation that introduces language hierarchy into prompt generation which helps bridge the vocabulary gaps between training and testing.Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.


Completely Weakly Supervised Class-Incremental Learning for Semantic Segmentation

Kim, David Minkwan, Lee, Soeun, Kang, Byeongkeun

arXiv.org Artificial Intelligence

This work addresses the task of completely weakly supervised class-incremental learning for semantic segmentation to learn segmentation for both base and additional novel classes using only image-level labels. While class-incremental semantic segmentation (CISS) is crucial for handling diverse and newly emerging objects in the real world, traditional CISS methods require expensive pixel-level annotations for training. To overcome this limitation, partially weakly-supervised approaches have recently been proposed. However, to the best of our knowledge, this is the first work to introduce a completely weakly-supervised method for CISS. To achieve this, we propose to generate robust pseudo-labels by combining pseudo-labels from a localizer and a sequence of foundation models based on their uncertainty. Moreover, to mitigate catastrophic forgetting, we introduce an exemplar-guided data augmentation method that generates diverse images containing both previous and novel classes with guidance. Finally, we conduct experiments in three common experimental settings: 15-5 VOC, 10-10 VOC, and COCO-to-VOC, and in two scenarios: disjoint and overlap. The experimental results demonstrate that our completely weakly supervised method outperforms even partially weakly supervised methods in the 15-5 VOC and 10-10 VOC settings while achieving competitive accuracy in the COCO-to-VOC setting. Introduction Class-incremental semantic segmentation (CISS) has become a vital research topic in the computer vision and robotics communities [30, 2, 4], as it enables learning to segment objects of novel classes in addition to previously learned categories using newly provided data.


Enhancing Weakly Supervised Semantic Segmentation for Fibrosis via Controllable Image Generation

Yue, Zhiling, Fang, Yingying, Yang, Liutao, Baid, Nikhil, Walsh, Simon, Yang, Guang

arXiv.org Artificial Intelligence

Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline. High-resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter-observer variability and time-intensive manual annotation. To tackle this challenge, we propose DiffSeg, a novel weakly supervised semantic segmentation (WSSS) method that uses image-level annotations to generate pixel-level fibrosis segmentation, reducing the need for fine-grained manual labeling. Additionally, our DiffSeg incorporates a diffusion-based generative model to synthesize HRCT images with different levels of fibrosis from healthy slices, enabling the generation of the fibrosis-injected slices and their paired fibrosis location. Experiments indicate that our method significantly improves the accuracy of pseudo masks generated by existing WSSS methods, greatly reducing the complexity of manual labeling and enhancing the consistency of the generated masks.